The aim of this document is to derive and present a method of multiple steps to be taken that convert the raw personal environmental data into an analysis ready dataset. For this I selected the data of 3 participants: ACT001D (very good data from visual inspection), ACT014F (some poor data), ACT032V (very poor data). The variables to be cleaned are temperature, relative humidity RH, and noise.
Main reasons for cleaning are:
The steps are the following with their relevance indicated with * (in my opinion):
First, data has to be excluded that was taken outside the observation window and during personal visit log times where the device was potentially taken down and charged. Both was done while compiling all the data to one data set and can be reviewed in the Week1_Data_Loop.R file. Below you can see the data from the 3 individuals with gaps where the pvls took place.
Every Variable (temperature, RH, noise) has its physical limits that the following:
# House
data_H <- data_H |>
mutate(IBH_TEMP_01 = if_else(IBH_TEMP < -273, 1, 0),
IBH_HUM_01 = if_else(IBH_HUM < 0 | IBH_HUM > 100, 1, 0))
# Worn
data_W <- data_W |>
mutate(IBW_TEMP_01 = if_else(IBW_TEMP < -273, 1, 0),
IBW_HUM_01 = if_else(IBW_HUM < 0 | IBW_HUM > 100, 1, 0))
# Taped
data_T <- data_T |>
mutate(IBT_TEMP_01 = if_else(IBT_TEMP < -273, 1, 0))
# Noise
data_N <- data_N |>
mutate(NS_01 = if_else(NS < 0, 1, 0))
No plots are shown here because there are no impossible values in the example data.
The plausible range is to some degree subjective, depends on the observation surroundings and changes not only depending on the variable, but also what the variable describes (temperature taped and house). Therefore now we need to start with device specific variable value ranges.
# House
data_H <- data_H |>
mutate(IBH_TEMP_02 = if_else(IBH_TEMP < 0 | IBH_TEMP > 55, 1, 0))
# Worn
data_W <- data_W |>
mutate(IBW_TEMP_02 = if_else(IBW_TEMP < 15 | IBW_TEMP > 45, 1, 0))
# Taped
data_T <- data_T |>
mutate(IBT_TEMP_02 = if_else(IBT_TEMP < 33 | IBT_TEMP > 39, 1, 0))
The variability between variables and devices differs significantly (eg. humidity house and worn). Because we are interested in stress experienced by the individuals, it is important to not filter out extreme but realistic conditions as these represent the largest stress impact. However we do want to filter out worn and taped measurements that resemble the variance of the house measurements and indicate the the device was not worn/taped. For the worn data, we use the moving standard deviation of 3 centered humidity values and for the taped data the same but from temperature values.
# Worn
data_W <- data_W |>
mutate(IBW_TEMP_07 = if_else(IBW_HUM_MSD < 0.75, 1, 0),
IBW_HUM_07 = if_else(IBW_HUM_MSD < 0.75, 1, 0))
# Taped
data_T <- data_T |>
mutate(IBT_TEMP_07 = if_else(IBT_TEMP_MSD < 0.05, 1, 0))
Important for the sequence analysis is to keep the measuring intervals in mind. For temperature and humidity this is 15 minutes while noise was measured every minute. Again, the feasibility of a variable changes depending on the variability itself and the object that is being observed (room and body temperature).
For this we need to come up with thresholds that should be exceeded from one value to the other, depending on the time interval, for temperature in 15 minutes. House and worn temperature can be very different but humidity should differ as the human body acts as a source of moisture. Noise can be very sporadic and intense and will not be filtered here.
# House
data_H <- data_H |>
mutate(IBH_TEMP_03 = if_else(abs(IBH_TEMP - lag(IBH_TEMP, default = first(IBH_TEMP))) > 3, 1, 0),
IBH_HUM_03 = if_else(abs(IBH_HUM - lag(IBH_HUM, default = first(IBH_HUM))) > 20, 1, 0))
# Worn
data_W <- data_W |>
mutate(IBW_TEMP_03 = if_else(abs(IBW_TEMP - lag(IBW_TEMP, default = first(IBW_TEMP))) > 6, 1, 0),
IBW_HUM_03 = if_else(abs(IBW_HUM - lag(IBW_HUM, default = first(IBW_HUM))) > 35, 1, 0))
# Taped
data_T <- data_T |>
mutate(IBT_TEMP_03 = if_else(abs(IBT_TEMP - lag(IBT_TEMP, default = first(IBT_TEMP))) > 3, 1, 0))
The logic behind this filter is that in an organic environment, nothing is exactly the same over time. Everything changes always ever so slightly. The absence of change indicates a malfunction (house observations) and mismeasurements (taped). This, however, could be disregarded as the exposure should be very close to the mismeasured value. Therefore this filter will be excluded for now but thresholds for deviance from the previous values should be at least the last digit of the available digits:
# WARNING Both outcomes are set to 0
# House
data_H <- data_H |>
mutate(IBH_TEMP_04 = if_else(abs(IBH_TEMP - lag(IBH_TEMP, default = first(IBH_TEMP))) < 0.00001, 0, 0),
IBH_HUM_04 = if_else(abs(IBH_HUM - lag(IBH_HUM, default = first(IBH_HUM))) < 0.001, 0, 0))
# Worn
data_W <- data_W |>
mutate(IBW_TEMP_04 = if_else(abs(IBW_TEMP - lag(IBW_TEMP, default = first(IBW_TEMP))) < 0.00001, 0, 0),
IBW_HUM_04 = if_else(abs(IBW_HUM - lag(IBW_HUM, default = first(IBW_HUM))) < 0.001, 0, 0))
# Taped
data_T <- data_T |>
mutate(IBT_TEMP_04 = if_else(abs(IBT_TEMP - lag(IBT_TEMP, default = first(IBT_TEMP))) < 0.00001, 0, 0))
# Noise
data_N <- data_N |>
mutate(NS_04 = ifelse(abs(NS - lag(NS, default = first(NS))) < 0.00001, 0, 0))
I suggest we should rather flag values that have not changed over a longer period of time like 8 intervals (2 hours). For this I utilized a backwards looking moving average.
# WARNING Both outcomes are set to 0
# House
data_H <- data_H |>
mutate(IBH_TEMP_05 = if_else(IBH_TEMP == IBH_TEMP_MA, 0, 0),
IBH_HUM_05 = if_else(IBH_HUM == IBH_HUM_MA, 0, 0))
# Worn
data_W <- data_W |>
mutate(IBW_TEMP_05 = if_else(IBW_TEMP == IBW_TEMP_MA, 0, 0),
IBW_HUM_05 = if_else(IBW_HUM == IBW_HUM_MA, 0, 0))
# Taped
data_T <- data_T |>
mutate(IBT_TEMP_05 = if_else(IBT_TEMP == IBT_TEMP_MA, 0, 0))
# Noise
data_N <- data_N |>
mutate(NS_05 = if_else(NS == NS_MA, 0, 0))
However, this leads to flagging of values after a slope change and
consequently is not a perfect method to detect unchanged values.
This correction investigates whether worn or taped devices were taken off which I believe is crucial in this data cleaning. For this, we use the house device as a reference. The issue here is that the timestamps are not exactly the same but can be shifted by up to 7 minutes and 30 seconds (due to the 15 min interval). Therefore, we will compare each taped and worn measurement with the measurement taken previously by the house device. As an indicator for this we will use only temperature as house and worn humidity can likely be very similar even with proper measurements.
The worn and taped temperature have to be within +- 0.1 °C of the house temperature to be flagged. Humidity measurements are flagged if the temperature measurements are flagged as they do no represent personal exposure anymore.
data_match_HW <- fuzzy_left_join(
data_H,
data_W,
by = "datetime",
match_fun = list(`<=`)
) |>
group_by(datetime.x) |>
slice_min(datetime.y) |>
ungroup() |>
mutate(Match_TEMP_HW = ifelse(IBW_TEMP < (IBH_TEMP + 0.1) & IBW_TEMP > (IBH_TEMP - 0.1), 1, 0)) |>
select(uid.x, datetime.x, IBH_TEMP, IBW_TEMP, IBW_HUM, Match_TEMP_HW)
data_match_HT <- fuzzy_left_join(
data_H,
data_T,
by = "datetime",
match_fun = list(`<=`)
) |>
group_by(datetime.x) |>
slice_min(datetime.y) |>
ungroup() |>
mutate(Match_TEMP_HT = ifelse(IBT_TEMP < (IBH_TEMP + 0.1) & IBT_TEMP > (IBH_TEMP - 0.1), 1, 0)) |>
select(uid.x, datetime.x, IBH_TEMP, IBT_TEMP, Match_TEMP_HT)
The final cleaning is the combination of all the above filters.
The most suitable methods for filtering are:
The following methods were less suitable because..
Study design limitation: only exclude data from PVLs if the corresponding device has been charged/changed
Physically plausible: lower threshold for worn temperature -> below 10 °C, threshold for taped temperature IQR
Variability: flag only values if 3 consecutive values have exceeded the threshold
Sequence: not useful
Final Plot: keep the the original unfiltered data in the background so that it is easier to evaluate the cleaning
Next steps: include 5 people, do the cleaning and show the final product, then calculate hourly averages and repeat the week1_summary code